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The self-organization in cooperative regimes in a simple mean-field version of a model 
based on "selfish" agents which play the Prisoner's Dilemma (PD) game is studied. The 
agents have no memory and use strategies not based on direct reciprocity nor 'tags'. 
Two variables are assigned to each agent i at time t, measuring its capital C(i;t) and its 
probability of cooperation p(i; t). At each time step t a pair of agents interact by playing 
the PD game. These 2 agents update their probability of cooperation p(i) as follows: they 
compare the profits they made in this interaction SC(i; t) with an estimator e(i; t) and, if 
SC(i;t) > e(i;t), agent i increases its p(i;t) while if 5C(i;t) < e(i;t) the agent decreases 
p(i;t). The 4\—24 different cases produced by permuting the four Prisoner's Dilemma 
canonical payoffs 3, 0, f , and 5 - corresponding, respectively, to R (reward), S (sucker's 
payoff), T (temptation to defect) and P (punishment) - are analyzed. It turns out that 
for all these 24 possibilities, after a transient, the system self-organizes into a stationary 
state with average equilibrium probability of cooperation p^ = constant > 0. Depending 
on the payoff matrix, there are different equilibrium states characterized by their average 
probability of cooperation and average equilibrium per-capita-income (poo, SC 00) ■ 
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I. INTRODUCTION 

A common approach to the problem of how cooperation emerges in societies of " selfish" individuals 
- individuals which pursue exclusively their own self-benefit - is based on game theory, and specifically 
on the Prisoner's Dilemma (PD) of the early fifties. In a series of works Robert Axelrod and co-workers 
[1] used this kind of computer games to examine the basis of cooperation between selfish agents in a wide 
variety of contexts. Mechanisms of cooperation based on the PD have shown their usefulness in economy 
[2]- [8], political science [9]- [11], international relations theory [12]- [15], theoretical biology [16]- [18], 
ecosystems [19]- [20], etc. 

The beauty of the PD game relies on the fact that it embodies the central ingredients of the coop- 
eration problem in a very simple and intuitive way. There are two players, each confronting two choices: 
cooperate (C) or defect (D) and each makes its choice without knowing what the other will do. Inde- 
pendently of what the other player does, defection D yields a higher payoff than cooperation and is the 
dominant strategy. In other words, the outcome (D,D) of both players is the Nash equilibrium [21]. The 
dilemma is that if both defect, both do worse than if both had cooperated. 

The emergence of cooperation in prisoner's dilemma (PD) games is generally assumed to require 
repeated play (and strategies such as Tit for Tat (TFT) [1], involving memory of previous interactions) 
or features ("tags") permitting cooperators and defectors to distinguish one another [22]. 

In this work, I consider a simple model of selfish agents playing PD, possessing neither memory nor 
tags, to study the self-organized cooperative states which emerge for different payoff matrices. The model 
consists of N ag agents, with two variables assigned to each agent at the site or cell i and at time t: its 
probability of cooperation p(i; t) and its capital C(i;t). Pairs of agents, 1 and 2, interact by playing the 
PD game at each time step t. That is, there are 4 possible outcomes for the interaction of agent i with 
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agent j : 1) they can both cooperate (C,C) 2) both defect (D,D), 3) i cooperates and j defects (C,D) and 
4) i defects and j cooperates (D,C). Depending on the situation l)-4), the agent i (j) gets respectively : 
the "reward" R(R), the "punishment" P(P), the "sucker's payoff' S(the" temptationtodefect" T) or T(S), 
i.e. the payoff matrix M RSTP is 



The payoff matrix gives the payoffs for ROW actions when confronting with COLUMN actions. 

After playing the PD the agents update their probability of cooperation p(i; t) and p(j; t) according to 
the same definite " measure of success" which does not vary with time . Thus all agents follow a universal 
and invariant strategy defined by a measure of success plus an updating rule to transform p(l;t) and 
p(2; t) into p(l; t + 1) and p(2; t + 1). 

The 4!=24 different payoff matrices produced by permutation of the four Prisoner's Dilemma canon- 
ical payoffs -3, 0, f , and 5- are analyzed by means of a Mean Field (MF) approach, in which all the 
spatial correlations in the system are neglected. It turns out that for all these 24 possibilities, after a 
transient, the system self-organizes into a state of equilibrium characterized by the average probability 
of cooperation and average per-capita-income (ftoo,$C 'oo), always with p x > 0. Furthermore, in the 
majority of cases p^ > 0.5. 

Payoff matrices can be classified into sub-categories according to their dominant strategy. Let us 
call Md the class of those matrices such that: 



for which the dominant strategy is D. This class comprises six matrices: M 3051 , M 1053 , M 1035 , M 0315 , 
M 0153 and M 0135 . A second class Mc corresponds to 



for which the dominant strategy is C and comprises the following six matrices: M 5310 , M , M 5130 , 
M 3510 , M 3501 and M 1503 . The remaining twelve matrices do not comply with equation (I) or (2) and 
produce situations not dominated by (D,D) or (C,C). 

The only payoff matrix that implies a dilemma, in the sense explained above, is the canonical one 
with i? = 3, S = 0, T = 5 and P — 1 which belongs to class Mb and comply with the condition (1) plus 
condition R > P, or equivalently the chain of inequalities: T>R>P>S 1 . However, some matrices 
exhibit a tension between C and D and give rise to p ~ \. The matrices which do not embodie such 
trade-off produce the situations which depart from p^ ~ i . Clearly, these payoff matrices are unrealistic 
in order to model the social behavior of the majority of individuals. So, why bother to study matrices 
which imply no dilemma? Well, one reason is that they could be of importance in other contexts. One 
might envisage situations in which a definite value of poo is required in the design of a system or is the 
one which optimizes the functioning of a particular mechanism, etc. Another motivation is that this 
" unreasonable" payoff matrices can be used by minorities of individuals which depart from the " normal" 
ones (assumed to be neutral) for instance, D-inclined "free riders" or C-inclined "altruistic" individuals. 
Finally, we will show results for these payoff matrices which, at first glance, defy our intuition. For 
example, payoff matrices which, at least in principle, one would bet that favor cooperation and indeed 
produce a very low degree of cooperation. 



1 In fact there is an "anti-dilemma" posed by matrix M 1503 for which T < R < P < S and although the dominant 
strategy is C both players would prefer the punishment P associated with (D,D). 




T> R, and P > S, 



(1) 



R>T, and 5> P, 



(2) 
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II. THE MODEL 



The pairs of interacting partners, by virtue of the MF treatment, are chosen randomly instead of 
being restricted to some neighborhood. The implicit assumptions are that the population is sufficiently 
large and the system connectivity is high i.e. the agents display high mobility or they experiment 
interaction at a distance (for instance electronic transactions) . In this work the population of agents will 
be fixed to N ag = 1000 and the number of time steps will be of order tf = 10 5 — 10 6 in such a way that 
both assumptions be also consistent with the fact that agents have no memory. 

Starting from an initial state at t — taken as p(i; 0) chosen at random ( in the interval [0,1] ) and 
C(i, 0) = for each cell i, the system evolves by iteration during tf time steps following these stages in 
this order: 

1. Selection of players: At each time step t two agents, located at random positions i and j, are 
selected to interact i.e. for playing the PD game. 

2. Playing pairwise PD: The action, C or D, of each interacting agent k ( k=i or k=j ) is decided 
generating a random number r and if p(k; t) > r then it cooperates and, conversely, if p(k; t) < r it 
defects. 

3. Capital update: As a result of the interaction the capital of each interacting agent k is updated as 
C(k;t) — ► C(k;t) + SC(k;t), being the profit of agent k, SC(k;t) one of the four PD payoffs: R, 
S, T or P. 

4. Assessment of success: Each of the two agent who have just interacted compares its profit 6C(k; t) 
with an estimate e(k;t) of the expected utilities. If SC(k;t) > e(k;t) (SC(k;t) < e(k;t) ) the agent 
assumes it is doing well (badly) and therefore its level of cooperation is adequate (inadequate). 

5. Probability of cooperation update: Pursuing to increse their utilities in future PD games the agents 
that just interacted update their p(k; t). If agent k is doing well it increases its probability of coop- 
eration p(k;t) choosing an uniformly distributed value between p(k;t) and 1. On the other hand, 
if agent k is doing badly it decreases its probability of cooperation p(k; t) choosing an uniformly 
distributed value between and p(k;t) (see below for a discusion of this update rule). 

Let us see how the estimate e(fc; t) emerges naturally. If the interacting agents i and j cooperate with 
probabilities pi and pj respectively (and defect with probabilities 1 — pi and 1 — pj ) , then the expected 
value of the payoff to i, SC^ STP is given by: 

6C RSTP = Rp . p . + Spi{1 _ pj) + T(1 _ p . )p . + p (1 _ _ pj y (3) 

Hence I consider an estimate e(k; t), which only involves the probability of cooperation p(k; t) of the agent 
k who uses the estimate, obtained by replacing in equation (3) pi and pj by p(k;t): 

e RSTP {k; t) = (R-S-T + P)p(k; tf + (S + T- 2P)p(k; t) + P. (4) 

While the measure of success seems natural, the updating rule for the probability of cooperation is 
quite arbitrary. For instance, for the case of the canonical payoff matrix, the update rule for the probability 
of cooperation implies the following: if your partner cooperated, increase your level of cooperation; else 
lower it (of course, with boundaries at and 1). A priori, it is not obvious if this is a good update rule 
in order to maximize your utilities. After all, the other player might be a sucker. In that case perhaps 
you should defect more. However, as we will see in the next section, this update rule basically works by 
tuning the agent's cooperation in order to accomplish some sort of "indirect reciprocity" which in turn 
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produces cooperative equilibrium states. Furthermore, modifying the strategy of each agent so that it 
defects more often, when it is doing well, pursuing to exploit an assumed high percentage of suckers, ends 
by spoiling cooperation. A natural implementation of this change of strategy would be: if you are doing 
well decrease your probability of cooperation p with probability proportional to 1 — p and increase it with 
probability proportional to p. But this is equivalent to the replicator dynamics [24] - [26] for which, in 
ordinary situations, it is known that the cooperation becomes extinct. 

General remarks on the model. 

I. Among the weaknesses of major approaches that have been considered to answer the question 
about the emergence of cooperation two are often remarked. 

LA The first criticism is about the generally assumed "binary" probability of cooperation i.e. agents 
either always cooperate (C) or always defect (D). Clearly, this is no very realistic. Indeed the levels of 
cooperation of the individuals are continuous. Hence a realp(i;£) (in the interval [0,1]) is used, reflecting 
existence of a "gray scale" of levels of cooperation instead of just "black" and "white". 

LB The second objection is concerning the deterministic nature of the algorithms which seem to fail 
to incorporate the stochasticity of agent behavior. The used algorithm is non deterministic. Comparison 
with the random number r reflects a stochastic component of agents behavior. 

II. All the agents follow the same universal strategy which does not evolve over time. However, the 
system is adaptive in the sense that the probabilities of cooperation of the agents do evolve. 



III. RESULTS 

For all the 24 payoff matrices the system self-organizes, after a transient, in equilibrium states 
with six values of poo > 0: 1, 0.56±0.003', 0.5±0.02, 0.42±0.006, 0.22±0.002 and 0.115±0.005. The 24 
measures are performed over 100 simulations of 10 6 time steps each. Fig. 1 show the average probability 
of cooperation for different payoff matrices vs. time for the 200,000 first time steps. 

Roughly, the equilibrium asymptotic states can be classified in 3 classes: highly cooperative (p^ > 
0.5), moderately cooperative (poo ~ 0.5) and of loow cooperation ( p^ < 0.5). In the second column of 
Table 1 are listed the values of p^ for the 24 payoff matrices. 



4 




1e+05 2e+05 

t 

FIG. 1. Curves of p vs. time, corresponding to the 24 choices of payoff matrix M HSTP . The system self-organizes 
in 6 different cooperative states with: p^ = 1 (filled lines), p^ ~ 0.56 (dotted lines), poc ~ 0.5 (dashed lines), 
Poo ~ 0.42 (dot-dashed lines), ~ 0.22 (+'s) and p^ ~ 0.115 (x's) 



RST P 


Poo 




5 3 10 (C) 


0.425 


1.87 


5 3 1 (C) 


0.113 


1.15 


5 13 (C) 


0.42 


1.81 


5 10 3 


0.49 


2.16 


5 3 1 


0.22 


1.35 


5 13 


0.49 


2.16 


3 5 10 (C) 


1.0 


3.0 


3 5 1 (C) 


1.0 


3.0 


3 15 


1.0 


3.0 


3 10 5 _j 


0.485 


2.28 


3 5 1 (D) 


0.5 


2.25 


3 15 


0.485 


2.28 


15 3 


0.51 


2.22 


1 5 3 (C) 


0.5 


2.25 


13 5 


0.51 


2.22 


13 5 


0.495 


2.27 


1 5 3 (D) 


0.5 


2.25 


1 3 5 (D) 


0.42 


2.6 


5 3 1 


0.5 


2.22 


5 13 (D) 


0.5 


2.23 


3 5 1 (D) 


0.51 


2.22 


3 15 (D) 


0.495 


2.25 


15 3 


0.56 


2.1 


13 5 


0.48 


2.37 



Table 1. Equilibrium values of probability of cooperation p^f TP & income-per-agent SC^ for the 24 
possible payoff sets { R S T P }. (C) or (D) in first column indicate if the dominant strategy is C or D. 



5 



The relation between utilities and probability of cooperation 

Let us now analyze the average equilibrium income-per-agent SCoo for the different payoff matrices. 
The curves of per-capita-income 5C RSTP as a function of the average probability of cooperation p are 
the parabolas obtained by replacing in equation (3) pi and pj by p i. e. 



SC 



RSTP 



(p) = {R-S-T+ P)p 2 + {S + T- 2P)p + P. 



(5) 



These curves are invariant under the interchange of the sucker's payoff S and the temptation T, i.e. 

SC RSTP ( P ) = 5C RTSP ( P ), (6) 

i.e. the 24 payoff matrices give rise to the 12 different parabolas depicted in Fig. 2. In each subplot, the 
values of {RSTP} and {RTSP} and the equation (5) for the corresponding parabola is indicated. For 
example, we have in the first box: SC 5310 (p) & SC 5130 (p) = p 2 + 4p. In all the subplots the equilibrium 



points (poo, SCoo) for the payoff matrix with S > T and S < T are denoted, respectively, by circles and 
the '+'s. Note that by virtue that 



g(jRSTP^ _ R + S + T + P 



9 

4' 



all the parabolas SC RSTP (j>) pass through the point [1/2,9/4]. The values of SC^ are listed in 
the third column of Table 1 for the 24 payoff matrices. 



6 c 




8 c 



0.2 0.4 0.6 0.8 
P 



3.2 0.4 0.6 0.8 
P 



0.2 0.4 0.6 0.8 



0.2 0.4 0.6 0.8 
P 



FIG. 2. The 12 curves of per-capita-income SC (p) corresponding to the 24 choices of payoffs R,S,T and 
P in equation (5). The payoff matrices M flSTP and M HTSP produce the same parabola. In each box is indicated 
the equation for the parabola and the corresponding pair of payoffs which produce it between parenthesis (RSTP 
& RTSP). The quilibrium points (poo , SCoo ) , listed in Table 1, are marked with circles over the curves for the 
case S > T and with '+'s for the case S < T. 



6 



Let us analyze the distributions of probabilities of cooperation and their corresponding average 
capitals and average income-per-agent. Fig. 3, Fig. 4 and Fig. 5 illustrate, respectively, the cases of 
payoff matrices giving rise to equilibrium states with p^ > 0.5, poo ~ 0.5 and poo < 0.5. Measures 
are performed over 100 simulations of 50,000 time steps each, after the equilibrium state was reached 
i.e. typically discarding the first 200,000 configurations 2 . The upper plots are distributions for the 
probabilities of cooperation p using 100-bin histograms. The frequencies v(p) are normalized in such a 
way that the total area is equal to 1. The middle (lower) plots present the corresponding average capitals 
C{p) (average income-per-agent SC(p)) obtained by taking the quotients between the histograms for the 
capitals ( income-per-agent ) and the frequencies histograms. Fig. 3 corresponds to 2 payoff matrices 
giving rise to high cooperation: M 3150 and M 0153 , with p 3 ^ 50 — 1 and p^ 53 ~ 0.56. The histograms of 
frequencies v{p) exhibit a peak at p = 1 (in the case of M 3150 the 3 histograms are non null only for 
p = 1). Fig. 4 illustrates two cases of moderate cooperation produced by payoff matrices M 3051 (the 
canonical one) and M 5310 , with p 3 ^ 51 = 0.5 and pf^ 53 ~ 0.42. Both histograms exhibit two peaks, one at 
p = and one at p = 1. 



0.9 0.92 0.94 0.96 0.9 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



C f,00 - 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 



FIG. 3. 100-bin histograms for highly cooperative payoff matrices M 3150 (thin line) and M 0153 (thick line). 
Above: Distribution of probabilities of cooperation p. The inset is a zoom showing the smaller peak at p = 1 
for M 0153 . Middle: the corresponding average capitals i.e. C vs. p. Below: the corresponding average in- 
come-per-agent i. e. 5C vs. p. 



2 The particular case of matrix M 3150 approach to = 1 more slowly and after 200,000 iterations the system 
has not reached equilibrium yet as can be seen from Fig.l. In that case 450,000 iterations were discarded before 
measuring. 
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0.2 0.4 0.6 0.8 1 

P 

FIG. 4. 100-bin histograms for moderately cooperative payoff matrices M 3051 (thin line) and M 5310 (thick line). 
Above: Distribution of probabilities of cooperation p. Middle: the corresponding average capitals i.e. C vs. p. 
Below: the corresponding average income-per-agent i. e. 5C vs. p. 



In Fig. 5 are shown two cases of low cooperation, produced by M 5301 and M 5031 , with p^ 01 ~ 0.113 
and p^ 31 — 0.22. Both histograms of frequencies exhibit a peak at p = 0. 




FIG. 5. 100-bin histograms for low cooperative payoff matrices M 5301 (thin line) and M 5031 (thick line). Above: 
Distribution of probabilities of cooperation p. The inset is a zoom showing the smaller peak at p = for M 5031 . 
Middle: the corresponding average capitals i. e. C vs. p. Below: the corresponding average income-per-agent i. e. 
5C vs. p. 
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Let us summarize the main results which emerge from the data: 

• The state of complete cooperation p^ = 1 is reached for payoff matrices with R = 3: M 3510 , M 3501 
and M 3150 . produce the highest possible average cooperation p^ = 1. 

• The payoff matrices with the highest possible reward R=5, contrary to one might think, do not 
produce the higher poo- Moreover, the states of lower average cooperation is produced by a payoff 
matrix with R = 5 : poc — 0.115 occurs for M 5301 and poo = 0.22 for M 5031 . The explanation 
of this fact relies on the adopted measure of success based on the estimate e(i;t): From equation 
(4) and from Fig. 2 note that for R = 5 the estimates, for high values of p, are e > 3 (i.e. they 
are greater than all the payoffs except the reward), making agents excessively exigent. In other 
words: too much rewarding makes the expectation of utilities by the agents to be so high that spoils 
cooperation. 

• From the two above results it is obvious that there is no completely clear connection between the 
dominant strategy and the equilibrium state. For instance, matrices belonging to class Mc produce 
both the highest and lowest values of poc . 

• The highest 6C oo is obtained from payoff matrices which produce the highest poo, namely SC^ 10 = 
5C^ al = SC^ 01 = 3. On the other hand, the lowest SCoo is obtained from payoff matrix which 
produce the lowest poo, namely SC^ 01 ~ 1.15. 

• The distributions for the probability of cooperation are clearly non uniform showing peaks at p = 
or / and at p = 1 . 

• The strategy used by the agents is robust enough to lead for all the payoff matrices to poo > 0. 
Furthermore, for the majority of the 24 payoff matrices poo > 0.5. This robustness relies on the 
strategy combining the proposed measure of success and update rule for the probability of cooper- 
ation. Basically it works by tuning the agent's cooperation guided by a trade-off between efficiency 
(increase of utilities) and equity (indirect reciprocity). If the agent is doing well it behaves nicely 
and increases its probability of cooperation. Nevertheless, in future interactions, if its probability of 
cooperation is inadequate (too high) and it does badly (it is exploited) then it reacts by decreasing 
its cooperation till it starts doing well again. 

• The equilibrium states are such that, although the average income-per-agent depends on the value 

_ RSTP — RSTP 

of the probability of cooperation p i.e. SC — SC (p), the distribution of average capitals 
is almost uniform and does not depend on p (as can be observed from Fig. 3 to Fig. 5). This is 
consistent with the fact that agents constantly adapt their probability of cooperation in such a way 
to improve utilities. Hence, for a given value of p, the utilities of each agent, with probability of 
cooperation p, oscillates around e(p) in such a way that their accumulated capital at a given time 
(in equilibrium) is independent of p. 



IV. CONCLUSIONS AND FINAL REMARKS 



The first general conclusion is concerning the robustness of the cooperative asymptotic state, which 
indicates that, in this model, cooperation seems based more in a sort of indirect reciprocity than in selfish 
incentives. For example, the permutation of the canonical values of R and T has the dramatic effect 
of transforming a society with an intermediate level of cooperation into one dominated by defection, 
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as it arises from comparing the results for payoff matrices M 3051 and M 5031 . On the other hand, the 
permutation of the canonical values of S and P has also a dramatic effect: it transforms a society with 
an intermediate level of cooperation into a completely cooperative one, as one can see from comparing 
the results for payoff matrices M 3051 and M 3150 . 

An interesting extension of the model would be to allow competition of different strategies to promote 
their evolution i.e. players which imitate the best-performing ones in such a way that lower scoring 
strategies decrease in number and the higher scoring increase. In particular, a possibility would be to 
associate different strategies with the use of disctint payoff matrices. For instance, individuals inclined to 
cooperate (defect) might be represented by agents using the payoff matrix M 3150 (M 5301 ) while "neutral" 
agents by agents using the canonical payoff matrix M 3051 . This would make possible to study if mutants 
inclined to D can invade a group of neutral individuals or individuals inclined to C and drive out all 
cooperation. However, a previous necessary step was the knowledge of the effect of changing the payoff 

matrix on the system self-organization and in particular on the equilibrium point (p^f TP , 6C ^ ). So 
in this work we considered each of these 24 payoff matrices by separate. 

This model can be extended in other ways in order to make it more realistic. For instance, here I con- 
sidered a MF approximation which neglects all the spatial correlations. One virtue of this simplification 
is that it shows the model does not require that agents interact only with those within some geographical 
proximity in order to sustain cooperation. Playing with fixed neighbors is sometimes considered as an 
important ingredient to successfully maintain the cooperative regime [27], [28]. However, the quality of 
this MF approximation depends on the nature of the system one desires to model (people, cultures of 
bacteria, market of providers of different services or products, etc.). Therefore, in order to apply the 
model to situations in which the effect of geographic closeness cannot be neglected an interesting exten- 
sions of the model would be: to transform the entirely random PD game into a spatial PD game, in which 
individuals interact only (or mainly) with those within some geographical proximity. 

To conclude, this work is based on the canonical assumption that individuals are entirely self- 
interested. However, recent investigations, performed in twelve countries on four continents, have un- 
covered systematic deviations from the material payoff-maximizing dogma [29] . In addition to their own 
material payoffs, many experimental subjects appear to prefer to share resources and undertake costly 
reciprocal actions in anonymous one-shot interactions. Therefore, an open issue is how to incorporate 
this fact in a more realistic model. 
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